NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

https://doi.org/10.52202/079017-2952

Somepalli, Gowthami; Chowdhury, Arkabandhu; Basri, Ronen; Geiping, Jonas; Goldstein, Tom; Jacobs, David (December 2024, Neural Information Processing Systems Foundation, Inc. (NeurIPS))

The recent emergence of powerful Vision-Language models (VLMs) has significantly improved image captioning. Some of these models are extended to caption videos as well. However, their capabilities to understand complex scenes are limited, and the descriptions they provide for scenes tend to be overly verbose and focused on the superficial appearance of objects. Scene descriptions, especially in movies, require a deeper contextual understanding unlike general-purpose video captioning. To address this challenge, we propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully “contextual” scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model’s ability to provide scene captions. Lastly, we observe that our model responds well to prompt engineering and few-shot in-context learning techniques, enabling the user to adapt it to any new movie with very little additional annotation.
more » « less
Full Text Available
CALVIN: Improved Contextual Video Captioning via Instruction Tuning

Somepalli, Gowthami; Chowdhury, Arkabandhu; Geiping, Jonas; Basri, Ronen; Goldstein, Tom; Jacobs, David W (November 2024, Advances in Neural Information Processing Systems)

Full Text Available
From Pixels to Prose: A Large Dataset of Dense Image Captions

Singla, Vasu; Yue, Kaiyu; Paul, Sukriti; Shirkavand, Reza; Jayawardhana, Mayuka; Ganjdanesh, Alireza; Huang, Heng; Bhatele, Abhinav; Somepalli, Gowthami; Goldstein, Tom (June 2024, ArXiv)

Full Text Available
Understanding and Mitigating Copying in Diffusion Models

Somepalli, Gowthami; Singla, Vasu; Goldblum, Micah; Geiping, Jonas; Goldstein, Tom (December 2023, NeurIPS 2023)

This paper proposes solutions to detecting and mitigating the blatant replication and memorization of data used to train text-to-image generators, especially Stable Diffusion. The potential for diffusion models to reproduce copyrighted or private images without user knowledge poses significant ethical and legal challenges. For lawmakers, this highlights the need for clear guidelines and regulations around the use of such models, especially in commercial applications.
more » « less
Full Text Available
How much Data is Augmentation Worth?

Geiping, Jonas; Goldblum, Micah; Somepalli, Gowthami; Shwartz-Ziv; Ravid; Goldstein, Tom; Gordon-Wilson, Andrew (July 2022, ICML Workshop on Spurious Correlations, Invariance and Stability)

Full Text Available
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective

https://doi.org/10.1109/CVPR52688.2022.01333

Somepalli, Gowthami; Fowl, Liam; Bansal, Arpit; Yeh-Chiang, Ping; Dar, Yehuda; Baraniuk, Richard; Goldblum, Micah; Goldstein, Tom (June 2022, CVPR)

Full Text Available

Search for: All records